Research question
The survey question “Will People Born Today Have a Better Life Than Their Parents?” assume only binary answers “Yes” or “No” and coded in dataset in the same way.
I will prefer not to build a model with random good-looking predictors, but elaborate a very small research based on specific topic - how respondent’s level of expertise (by professional coding expirience and level of influence at work) and structure at work can influence answers on outcome question.
I think, that such parameters as expertise and structuredness can reflect level of respondent’s confidence, and therefore I assume it will also influence on evaluation of ‘how hard is life to me and people around in comparison with our parents’ - logic is simple, if you ‘fit’ well in life, feeling confident and your work is structured, you can think that your life is better, than your parents’ life.
Let’s test that hypothesis!
Predictors
I will use three predictors in my main model: YearsCodePro, WorkPlan, PurchaseWhat.
Continuous: * YearsCodePro: How many years have you coded professionally (as a part of your work)?
Categorical: * WorkPlan: How structured or planned is your work? * PurchaseWhat: What level of influence do you, personally, have over new technology purchases at your organization?
Now I will load the data and relevel factors, then recode them with more short names, it will be more useful to read and interpret.
In preprocessing:
- dataset was filtered by age to include values in range from 18 to 60 to avoid outliers
- also dataset was filtered by predictor YearsCodePro to not include values “Less than 1 year” & “More than 50 years” and then recoded to numeric
- then numeric predictor YearsCodePro was scaled and centered
- all NA’s in predictors PurchaseWhat & WorkPlan were recoded as “Non-responce” values
- lastly, all but YearsCodePro variables (including outcome) were recoded to factors:
- reference level in WorkPlan is “There’s no schedule or spec; I work on what seems most important or urgent”
- reference level in PurchaseWhat is “I have little or no influence”
Model equation
That’s how my model’s equasion looks like - I’m not interested in any interactions, using only direct effect on outcome and evaluating it:
\[\log[P(BetterLife)/P(1-BetterLife)] = \beta_{0} + \beta_{1} \cdot YearsCodePro + \beta_{2} \cdot WorkPlan + \beta_{3} \cdot PurchaseWhat\]
Data description
So here is quick description of chosen variables:
##
## Descriptive statistics by group
## group: No
## vars n mean sd median trimmed mad min max range skew
## BetterLife* 1 23242 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0 NaN
## YearsCodePro 2 23242 0.67 7.51 -2.31 -0.49 5.93 -7.31 41.69 49 1.39
## WorkPlan* 3 23242 2.00 0.85 2.00 1.92 1.48 1.00 4.00 3 0.65
## PurchaseWhat* 4 23242 2.08 1.05 2.00 1.97 1.48 1.00 4.00 3 0.62
## kurtosis se
## BetterLife* NaN 0.00
## YearsCodePro 1.65 0.05
## WorkPlan* -0.08 0.01
## PurchaseWhat* -0.83 0.01
## ------------------------------------------------------------
## group: Yes
## vars n mean sd median trimmed mad min max range skew
## BetterLife* 1 38735 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0 NaN
## YearsCodePro 2 38735 -0.41 6.67 -2.31 -1.51 4.45 -7.31 37.69 45 1.56
## WorkPlan* 3 38735 2.03 0.83 2.00 1.97 1.48 1.00 4.00 3 0.53
## PurchaseWhat* 4 38735 2.11 1.04 2.00 2.01 1.48 1.00 4.00 3 0.56
## kurtosis se
## BetterLife* NaN 0.00
## YearsCodePro 2.53 0.03
## WorkPlan* -0.21 0.00
## PurchaseWhat* -0.88 0.01
From the survey’s report we already know that 63.7% choose “Yes” and 36.3% choose “No” answering on our outcome variable question - so most simple model will tell that we have probability of 0.63 to obtain “Yes” as an answers.
From interesting points on that data description:
- looking at centrality measures of YearsCodePro predictor (centered!) can be seen that in group with answer “No” it’s mean = 0.67 and in group with answer “Yes” mean = -0.41. From that I can assume, that respondents in group “No” in general have more experience in coding professionally than respondents in group “Yes” - such data goes against my hypothesis about years of coding and I need to check it in logistic regression model later.
For other factor variables it is meaningless to look at such table so I will use plots and cross-tables to check them.
Checking for missing values
I have only 1 percent of NA’s in my data provided only in outcome variable, all predictors’ NA’s were recoded into “Non-response” values.
Anyway, I will omit that 1 percent of NA’s to take to a model really clean data.
Variables scales and distributions
YearsCodePro: How many years have you coded professionally (as a part of your work)?
As can be seen, main group of distribution is gathered left from zero and there is long tail right from zero - our distribution highly skewed. But both groups have such skew.
WorkPlan: How structured or planned is your work?
Here the same proportions of “Yes” & “No” respondents in each group of work planning except “Non-responce” group. Also can be seen, that proportion between “Yes” and “No” is slightly raising to favor of “Yes” group with raise of level work structuredness - that helps my hypothesis a little.
PurchaseWhat: What level of influence do you, personally, have over new technology purchases at your organization?
Here the same proportions and they changes to “Yes” group favor even more slightly, so that not really helps my hypothesis, but I will see it in model.
Checking distributions per each category
Also I can look at xtabs for better view on distribution between groups:
## WorkPlan
## BetterLife -lowStrct -medStrct -highStrct Non-responce
## No 6865 11086 3758 1533
## Yes 10540 18567 7449 2179
## PurchaseWhat
## BetterLife -lowInfl -medInfl -highInfl Non-responce
## No 8513 7881 3385 3463
## Yes 13587 13113 6405 5630
We have more respondents with lower levels of structuredness and influence at work - I can assume, that we have just more junior lower-aged specialists, than senior ones.
Also we can notice that the higher levels of structuredness and influence at work, the more proportion between “Yes” and “No” - almost 2/1 with “high” levels. So that part of my hypothesis is slightly supported.
Choosing the best model
Before creating a model I will split dataset into train/test subsets to gain accuracy after modelling:
bound <- floor((nrow(df)/4)*3)
set.seed(1313)
df <- df[sample(nrow(df)),]
df.train <- df[1:bound, ]
df.test <- df[(bound+1):nrow(df), ]And now let’s build a models with different combinations (and emtpy model too) of predictors and choose the better model:
# main model with 3 predictors
model <- glm(BetterLife ~ YearsCodePro + WorkPlan + PurchaseWhat, family = "binomial", data = df.train)
# side models with 2 predictors
model_YW <- glm(BetterLife ~ YearsCodePro + WorkPlan, family = "binomial", data = df.train)
model_YP <- glm(BetterLife ~ YearsCodePro + PurchaseWhat, family = "binomial", data = df.train)
model_WP <- glm(BetterLife ~ WorkPlan + PurchaseWhat, family = "binomial", data = df.train)
# side models with 1 predictor
model_Y <- glm(BetterLife ~ YearsCodePro, family = "binomial", data = df.train)
model_W <- glm(BetterLife ~ WorkPlan, family = "binomial", data = df.train)
model_P <- glm(BetterLife ~ PurchaseWhat, family = "binomial", data = df.train)
# empty model
model_0 <- glm(BetterLife ~ 1, family = "binomial", data = df.train)Firstly, I will look at LogLikelihood to choose model with closest to zero value of it:
## Likelihood ratio test
##
## Model 1: BetterLife ~ YearsCodePro + WorkPlan
## Model 2: BetterLife ~ YearsCodePro + PurchaseWhat
## Model 3: BetterLife ~ WorkPlan + PurchaseWhat
## Model 4: BetterLife ~ YearsCodePro
## Model 5: BetterLife ~ WorkPlan
## Model 6: BetterLife ~ PurchaseWhat
## Model 7: BetterLife ~ YearsCodePro + WorkPlan + PurchaseWhat
## Model 8: BetterLife ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -30602
## 2 5 -30606 0 8.8352 < 2.2e-16 ***
## 3 7 -30687 2 162.4324 < 2.2e-16 ***
## 4 2 -30646 -5 83.3000 < 2.2e-16 ***
## 5 4 -30713 2 134.5126 < 2.2e-16 ***
## 6 4 -30744 0 61.7700 < 2.2e-16 ***
## 7 8 -30555 4 377.1831 < 2.2e-16 ***
## 8 1 -30766 -7 422.2071 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By LogLikelihood I see that the most complex model (seventh in a list) have better value of it and all my models are significant.
library(rcompanion)
compareGLM(model_YW, model_YP, model_WP, model_Y, model_W, model_P, model, model_0)## $Models
## Formula
## 1 "BetterLife ~ YearsCodePro + WorkPlan"
## 2 "BetterLife ~ YearsCodePro + PurchaseWhat"
## 3 "BetterLife ~ WorkPlan + PurchaseWhat"
## 4 "BetterLife ~ YearsCodePro"
## 5 "BetterLife ~ WorkPlan"
## 6 "BetterLife ~ PurchaseWhat"
## 7 "BetterLife ~ YearsCodePro + WorkPlan + PurchaseWhat"
## 8 "BetterLife ~ 1"
##
## $Fit.criteria
## Rank Df.res AIC AICc BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1 5 46480 61220 61220 61270 0.0053510 0.0070590 0.009619 2.597e-70
## 2 5 46480 61220 61220 61280 0.0052080 0.0068700 0.009361 2.095e-68
## 3 7 46480 61390 61390 61460 0.0025680 0.0033940 0.004624 7.630e-32
## 4 2 46480 61300 61300 61320 0.0039220 0.0051780 0.007056 1.025e-54
## 5 4 46480 61440 61440 61480 0.0017360 0.0022950 0.003127 2.662e-23
## 6 4 46480 61500 61500 61540 0.0007317 0.0009682 0.001319 4.475e-10
## 7 8 46470 61130 61130 61210 0.0068620 0.0090420 0.012320 2.030e-87
## 8 1 46480 61540 61540 61550 0.0000000 0.0000000 0.000000 Inf
And also by three different pseudo-R values seventh most complex model is the better solution, despite all models including the best one have really bad values of pseudo-R below 0.01. So all of my models are bad and have no effect, but I can compare them and choose greatest from bad.
Deciding by LogLikelihood and pseudo-R coefficients I’m choosing seventh most complex model and elaborate further analysis on it.
Model summary and interpretation of coefficients
library(jtools) # https://cran.r-project.org/web/packages/jtools/vignettes/summ.html
summ(model, confint = T, digits = 2, exp = T, vifs = T)## MODEL INFO:
## Observations: 46482
## Dependent Variable: BetterLife
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(7) = 422.21, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.01
## Pseudo-R² (McFadden) = 0.01
## AIC = 61126.18, BIC = 61196.15
##
## Standard errors: MLE
## --------------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 1.40 1.34 1.46 15.09 0.00
## YearsCodePro 0.98 0.98 0.98 -16.28 0.00 1.04
## WorkPlan-medStrct 1.10 1.05 1.15 4.15 0.00 1.24
## WorkPlan-highStrct 1.29 1.22 1.37 8.76 0.00 1.24
## WorkPlanNon-responce 0.88 0.80 0.96 -2.77 0.01 1.24
## PurchaseWhat-medInfl 1.07 1.03 1.12 3.09 0.00 1.28
## PurchaseWhat-highInfl 1.31 1.24 1.39 9.17 0.00 1.28
## PurchaseWhatNon-responce 1.18 1.11 1.26 5.05 0.00 1.28
## --------------------------------------------------------------------------------
Odds ratios
All interpretations are to favor of “Yes” answer:
When YearsCodePro increases by one unit, the odds of BetterLife change by a factor of 0.97 = they multiply by 0.98 = they decrease by 97 percent [95% CI = 0.97; 0.98]
- When WorkPlan changes from “low” to:
- “med”: the odds increases by 109%
- “high”: the odds increases by 129%
- if “Non-responce”: the odds decreases by 88%
- When PurchaseWhat changes from “low” to:
- “med”: the odds increases by 107%
- “high”: the odds increases by 131%
- if “Non-responce”: the odds increases by 117%
Confidence intervals
No predictor is crossing 1 with its confidence intervals!
Predicted probabilities
For different factor predictors combinations:
pred_data <- with(df, data.frame(YearsCodePro = mean(YearsCodePro),
wp = expand.grid(unique(df$WorkPlan), unique(df$PurchaseWhat))[1],
pw = expand.grid(unique(df$WorkPlan), unique(df$PurchaseWhat))[2]))
names(pred_data) <- c("YearsCodePro", "WorkPlan", "PurchaseWhat")
pred_data$pred <- predict(model, newdata = pred_data, type = "response")
pred_data## YearsCodePro WorkPlan PurchaseWhat pred
## 1 -0.007079771 -lowStrct -medInfl 0.6005853
## 2 -0.007079771 -medStrct -medInfl 0.6230403
## 3 -0.007079771 -highStrct -medInfl 0.6605831
## 4 -0.007079771 Non-responce -medInfl 0.5696743
## 5 -0.007079771 -lowStrct -lowInfl 0.5833346
## 6 -0.007079771 -medStrct -lowInfl 0.6061234
## 7 -0.007079771 -highStrct -lowInfl 0.6443894
## 8 -0.007079771 Non-responce -lowInfl 0.5520843
## 9 -0.007079771 -lowStrct Non-responce 0.6229198
## 10 -0.007079771 -medStrct Non-responce 0.6448616
## 11 -0.007079771 -highStrct Non-responce 0.6813427
## 12 -0.007079771 Non-responce Non-responce 0.5925647
## 13 -0.007079771 -lowStrct -highInfl 0.6476639
## 14 -0.007079771 -medStrct -highInfl 0.6689312
## 15 -0.007079771 -highStrct -highInfl 0.7040744
## 16 -0.007079771 Non-responce -highInfl 0.6180795
Here I see that all predicted probabilities per different combinations of groups with mean of centered YearsCodePro is above 0.5 - that’s not really good.
For different YearsCodePro values:
With WorkPlan groups
library(ggplot2)
ggplot(pred_data3, aes(x = YearsCodePro, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = WorkPlan), alpha = 0.2) +
geom_line(aes(colour = WorkPlan), size = 0.75) +
ylim(0, 1)Those groups not really differ from each other, but I really have high structured work plan group above others, medium structured right below and low structured at the bottom (except Non-responce group) - that slopes help my hypothesis, the more structured respondent’s work, the more that respondent thinks that his/her life better in comparison with his/her parents.
But with years of coding professionally my hypothesis breakes and I see decrease of probability with more years given.
With PurchaseWhat groups
ggplot(pred_data3, aes(x = YearsCodePro, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = PurchaseWhat), alpha = 0.2) +
geom_line(aes(colour = PurchaseWhat), size = 0.75) +
ylim(0, 1)Here is the same situation and slopes of influence in purchasing helps my hypothesis, but with more years of coding professionally chances decreasing.
Average marginal effects
## factor AME SE z p lower upper
## PurchaseWhat-highInfl 0.0625 0.0067 9.3303 0.0000 0.0494 0.0756
## PurchaseWhat-medInfl 0.0168 0.0054 3.0921 0.0020 0.0062 0.0275
## PurchaseWhatNon-responce 0.0385 0.0075 5.1044 0.0000 0.0237 0.0533
## WorkPlan-highStrct 0.0593 0.0067 8.8503 0.0000 0.0462 0.0725
## WorkPlan-medStrct 0.0222 0.0054 4.1365 0.0000 0.0117 0.0327
## WorkPlanNon-responce -0.0306 0.0111 -2.7465 0.0060 -0.0524 -0.0088
## YearsCodePro -0.0052 0.0003 -16.4590 0.0000 -0.0058 -0.0046
- if centered YearsCodePro changes by one unit it will decrease probability on -0.005 percent
- if PurchaseWhat changes from “low” (with control of YearsCodePro):
- to “medium” probability is greater on 0.01 percent
- to “high” probability is greater on 0.06 percent
- to “Non-responce” probability is greater on 0.03 percent
- if WorkPlan changes from “low” (with control of YearsCodePro):
- to “medium” probability is greater on 0.02 percent
- to “high” probability is greater on 0.06 percent
- to “Non-responce” probability is lesser on 0.03 percent
And here I finally see that my hypothesis about structuredness and influence were right, but not about years of coding professionally!
Accuracy, ROC and Osius and Rojek’s test
pred = format(round(predict(model, newdata=df.test, type="response")))
pred <- factor(pred)
levels(pred) <- c("No", "Yes")
# levels(df.test$BetterLife)
library(caret)
library(e1071)
confusionMatrix(table(data=pred, df.test$BetterLife))## Confusion Matrix and Statistics
##
##
## data No Yes
## No 128 106
## Yes 5653 9608
##
## Accuracy : 0.6283
## 95% CI : (0.6207, 0.6359)
## No Information Rate : 0.6269
## P-Value [Acc > NIR] : 0.3607
##
## Kappa : 0.0139
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.022141
## Specificity : 0.989088
## Pos Pred Value : 0.547009
## Neg Pred Value : 0.629579
## Prevalence : 0.373088
## Detection Rate : 0.008261
## Detection Prevalence : 0.015102
## Balanced Accuracy : 0.505615
##
## 'Positive' Class : No
##
With such Accuracy and No-Information rate I can claim that my model is not statistically significantly better than no model at all.
Looking at ROC curve:
AUC = 55.3% shows that my model is “useless” according to the test summary itself:
## auc lower 95% CI upper 95% CI
## 55.33457 54.79547 55.87368
## attr(,"interpret")
## [1] "auc = 0.5 --> useless" "0.7 < auc < 0.8 --> good"
## [3] "0.8 < auc < 0.9 --> excellent"
Looking at significant difference between model and observed data:
## test stat val df pVal
## 1: HL chiSq 8.82096889 8 0.35762333
## 2: mHL F 0.55712121 9 0.83235596
## 3: OsRo Z 2.01124177 NA 0.04429993
## 4: SstPgeq0.5 Z 0.22523135 NA 0.82179931
## 5: SstPl0.5 Z 0.52881251 NA 0.59693552
## 6: SstBoth chiSq 0.33037183 2 0.84773608
## 7: SllPgeq0.5 chiSq 0.05072991 1 0.82179802
## 8: SllPl0.5 chiSq 0.28142922 1 0.59576598
## 9: SllBoth chiSq 0.57259517 2 0.75103909
Our goodness-of-fit measures shows that:
- for Hosmer and Lemeshow tests = 0.35
- for Osius and Rojek’s tests = 0.04
It is better to believe Osius and Rojek’s test, which shows level of significance = 0.04 (and that supported by other diagnostics) and that shows us that my model is really poor.
Model diagnostics
Linearity
## Test stat Pr(>|Test stat|)
## YearsCodePro 0.0578 0.8099
## WorkPlan
## PurchaseWhat
Looks strange, maybe problem lies in it?
Here too.
Multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## YearsCodePro 1.037050 1 1.018356
## WorkPlan 1.244897 3 1.037183
## PurchaseWhat 1.280756 3 1.042104
I don’t have VIF > 10, that’s good!
Outliers
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 7009 -1.634313 0.10219 NA
Some points can be outliers, but I think that doesn’t change my outcome - I’ve already have poor fitting model :(
Cook’s distance
## StudRes Hat CookD
## 19488 -1.0022709 0.0009808869 8.009260e-05
## 49428 1.3146992 0.0008639994 1.484042e-04
## 7009 -1.6343127 0.0002356230 8.252462e-05
## 61839 1.3462571 0.0008809057 1.625163e-04
## 56802 -1.6343127 0.0002356230 8.252462e-05
## 8906 -0.9847613 0.0010307213 8.049172e-05
I don’t have outliers, that’s great!
Conclusion
That was really interesting topic, I like logistic regression and will use it in future (just because it is interesting).
And here I’m partly proved my hypotheses: about structuredness and influence, but disproved about years of coding professionally. But with such effect sizes, pseudo-Rs and accuracy I can’t claim that I’ve built a nice model - data is also was interesting, but strange.
Thanks for the topic!